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f-f^ , We show how the execution time of algorithms on quantum computers depends on the architecture 

■ of the quantum computer, the choice of algorithms (including subroutines such as arithmetic), and 

f**^ ' the "clock speed" of the quantum computer The primary architectural features of interest are the 

' ability to execute multiple gates concurrently, the number of application-level qubits available, and 

f**^ , the interconnection network of qubits. We analyze Shor's algorithm for factoring large numbers in 

1 this context. Our results show that, if arbitrary interconnection of qubits is possible, a machine with 

' an application-level clock speed of as low as one-third of a (possibly encoded) gate per second could 

factor a 576-bit number in under one month, potentially outperforming a large network of classical 
computers. For nearest-neighbor-only architectures, a clock speed of around twenty-seven gates per 
second is required. 



^ , 1. Introduction 



Quantum computers are currently being designed that will take advantage of quantum me- 
chanical effects to perform certain computations much faster than can be achieved using 
current ("classical") computers^. Many technological approaches have been proposed. 



H I some of which are being investigated experimentally. DiVincenzo_proposed five criteria 



which must be met by any useful quantum computing technology"^. In addition to these 
criteria, a useful quantum computing technology must also support a quantum computer 
system architecture which can run one or more quantum algorithms in a usefully short 
time. This observation subsumes into one requirement several issues which, while not 
strictly necessary to build a quantum computer, will have a strong impact on the possibility 
of engineering a practical system. These include the importance of gate "clock" speed, 
support for concurrent gate operations, the total number of apnlication-level qubits sup- 
portable, and the complexities of the qubit interconnect networkl^l 

This paper discusses the impact of these architectural elements on algorithm execution 
time using the example of Shor's algorithm for factoring large numbers™. Shor's algorithm 
ignited much of the current interest in quantum computing because of the improvement in 
computational class it appears to offer on this important problem. Using Shor's algorithm, a 
quantum computer can solve the problem in polynomial time, for a superpolynomial speed- 
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Figure 1. Scaling of number field sieve (NFS) on classical computers and Shor's algorithm for factoring on a 
quantum computer, using BCDP modular exponentiation with various clock rates. Both horizontal and veilical 
axes are log scale. The horizontal axis is the size of the number being factored. 



up. Shor's algorithm is theoretically important, well defined, and utilizes building blocks 
(arithmetic, the quantum Fourier transform) with broad applicability, making it ideal for 
our analysis. 

On a classical computer, or a collection thereof, the time and computing resources to 
factor a large number, using the fastest known algorithm, scale superpolynomially in the 
length of the number (in decimal digits or bits). This algorithm is the generalized number 
field sieve (NFS) El. Its asymptotic computational complexity on large numbers is 

O(g(nfclog^»)^^^) (1) 

where n is the length of the number, in bits, and fc = if log 2. The comparable computa- 
tional complexity to factor a number N using Shor's algorithm is dominated by the time to 
exponentiate a randomly chosen number x, modulo N, for a superposition of all possible 
exponents. Therefore, efficient arithmetic algorithms for calculating modular exponentia- 
tion in the quantum domain are critical. 

Very often clock speed and other architectural features are ignored as issues in quantum 
computing devices, assuming that the superpolynomial speed-up will dominate, making the 
algorithm practical on any experimentally realizable quantum computer. Shor's algorithm 
runs in polynomial time, but the details of the polynomial matter: what degree is the poly- 
nomial, and what are the constant factors? 

An immediate comparison of the execution time to factor a number on classical and 
quantum computers is shown in Figure^ The performance of Shor's algorithm on a quan- 
tum computer using the Beckman-Chari-Devabhaktuni-Preskill (BCDP) modular exponen- 
tiation algorithmic is compared to classical computers running the general Number Field 
Sieve (NFS). The steep curves are for NFS on a set of classical computers. The left curve 
is extrapolated performance based on a previous world record, factoring a 530-bit number 
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in one month, established using 104 PCs and workstations made in The right curve 

is speculative performance using 1,000 times as much computing power. This could be 
100,000 PCs in 2003, or, based on Moore's law, 100 PCs in 2018. From these curves it is 
easy to see that Moore's law has only a modest effect on our ability to factor large numbers. 
The shallower curves on the figure are predictions of the performance of a quantum com- 
puter running Shor's algorithm, using the BCDP modular exponentiation routine, which 
uses 5n qubits to factor an n-bit number, requiring ^ 54?7'^ gate times to run the algorithm 
on large numbers. The four curves are for different clock rates from 1 Hz to 1 GHz. The 
performance scales linearly with clock speed. Factoring a 576-bit number in one month 
of calendar time requires a clock rate of 4 kHz. A 1 MHz clock will solve the problem in 
about three hours. If the clock rate is only 1 Hz, the same factoring problem will take more 
than three hundred years. 

The performance of the BCDP modular exponentiation algorithm is almost indepen- 
dent of architecture. However, the performance o f mo st polynomial-time algorithms varies 
noticeably depending on the system architecture EEl The main objective of this paper is 
to show how we can improve the execution time shown in Figure^by understanding the 
relationship of architecture and algorithm. 

2. Results 

We have analyzed two separate architectures, still technology independent but with some 
important features that help us understand performance. The AC {abstract concurrent) ar- 
chitecture is our abstract model, akin to what is commonly used when drawing quantum 
circuits. It supports arbitrary concurrency and gate operands any distance apart without 
penalty. The second architecture, NTC (neighbor-only, two-qubit gate, concurrent) , as- 
sumes the qubits are laid out in a one-dimensional line, and only neighboring qubits can 
interact. This is a reasonable description of several important experimental approaches, 
including a one-dimensional chain of quantum dots^^ the original Kane proposalEl and 
the all-siUcon NMR devicetEl 

Above the architecture resides the choice of algorithm, especially for basic arithmetic 
operations. The computational complexity of an algorithm can be calculated for total cost, 
or for latency or circuit depth, if the dependencies of variables allow multiple parts of a 
computation to be conducted concurrently. Fundamentally, the computational complexity 
of quantum modular exponentiation is 0(n'^)EIll that is, the execution cost grows as the 
cube of the number of qubits. It consists of 2n modular multiplications of n-bit numbers, 
each of which consists of 0{n) additions, each of which requires 0{n) operations. How- 
ever, O(n^) operations do not necessarily require 0{n^) time steps; the circuit depth can 
be made shallower than 0{n?) by performing portions of the calculation concurrently. 

On an abstract machine, we can reduce the running time of each of the three layers 
(addition, multiplication, exponentiation) to 0(log77,) time steps by running some of the 
gates in parallel, giving a total running time of 0(log^ n). This requires 0{n?) qubits and 
the ability to execute an arbitrary number of gates on separate qubits. Such large numbers of 
qubits are not expected to be practical for the foreseeable future, so interesting engineering 
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lies in optimizing for a given set of architectural constraints. 

Addition forms the basis of multiplication, and hence of exponentiation. Classically, 
many forms of adders have been used in computer hardware -\ The most basic type of 
adder, variants of which are used in both VBE and BCDP (as well as our algorithm F, 
below), is the carry-ripple adder, in which the carry portion of the addition is done linearly 
from the low-order bits to the high-order This form of adder is 0{n) in both circuit depth 
and complexity; it is the only efficient type for NTC linear architectures, in which the time 
to propagate the low-order carry is inherently constrained to 0(??). When long-distance 
gates are available, as in AC architectures, the use of faster adders such as conditional-sum, 
carrv-lookahe ad. or carr v-save adders can result in 0(log n) latency, though the complexity 
remains C>(n)| 15 | 16 | 17 | 

We have composed several algorithm variants, A through F, as well as investigated 
concurrent and parallel versions of the original Vedral-Barenco-Ekert (VBE)IEland BCDP 
algorithms only the fastest for our AC and NTC architectures are presented here. Four 
parameters control the behavior of the algorithm variants and how well they match a par- 
ticular architecture. These parameters include the choice of type of adder and the amount 
of space required. Algorithm variant D is tuned for AC using the conditional-sum adder, 
and F is tuned for NTC using the Cuccaro-Draper-Kutin-Moulton (CDKM) carry-ripple 
adderl^. We have optimized the parameter settings for each individual data point, though 
the differences are just barely visible on our log-log plot. The values reported here for both 
algorithms are calculated using 2n^ qubits of storage to exponentiate an n-bit number, the 
largest number of qubits our algorithms can effectively use. The primary characteristics of 
the algorithms shown in Figure |2l are summarized in Tabled The table lists the number 
of multiplication units executing concurrently, the space, measured in number of logical 
qubits, the concurrency, or number of logical operations taking place at the same time, and 
the overall circuit depth, or time, measured in gates. 



Table 1. Composition of our algorithms. 
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Figure|2lshows our results for our faster algorithms. We have kept the 1 Hz and 1 MHz 
lines for BCDP, and added matching lines for our fastest algorithms on the AC and NTC 
architectures. For AC, our algorithm D requires a clock rate of only about 0.3 Hz to factor 
the same 576-bit number in one month. For NTC, using our algorithm F, a clock rate of 
around 27 Hz is necessary. The graph shows that, for problem sizes larger than 6,000 bits, 
our algorithm D is one million times faster than the basic BCDP algorithm, and algorithm 
F is one thousand times faster For very large n, the latency of D is ~ 9nlog2(n). The 
latency of F is 20r7,^ log2(n). 

This relationship of architecture and algorithm has obvious architectural implications: 
concurrency is critical, and support for long-distance gates is important. 
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Figure 2. Scaling of number field sieve (NFS) and Shor's algorithms for factoring, using faster modular expo- 
nentiation algorithms. 



3. Discussion 

A fast clock speed is obviously also important for a fast algorithm; however, it remains an 
open question whether those quantum computing technologies which feature naturally fast 
physical quantum gates will have the fastest overall algorithm speed. All quantum comput- 
ing techn olo gie s fe ature some level of decoherence, requiring resources for quantum error 
correction 1 1 9 | 20 | 1 1 example, quantum computers based on Josephson junctions are 

likely to have extremely fast single-qubit and two-qubit gates, with a physical clock rate at 
the gigahertz level, as demonstrated in recent experiments 1211 However, the sing le-q ubit 
decoherence time is only about 1 fis for the most coherent superconducting qubitsl^ll Al- 
though "fast," the difficulty in long-term qubit storage and the needed resources for fault 
tolerant operation may be quite large, so these implementations might ma ke exce llent pro- 
cessors with poor memories. In sharp contrast, NMR-based approaches are quite 
slow, with nuclear-nuclear interactions in the kilohertz range. However, the much longer 

coherence times of nuclei make the use of NMR-based qubits as memory substantially 
74- 

easier«^. Ion trap implementations have the benefit of faster single-qubit-gate, two-qubit- 
gate, and qubit-measurement speeds with longer coherence times, but the added compli- 
cation of moving ionic qubits from trap to trap physically ESI or exchanging their values 
opticallyEHcomplicates the picture for the application-level clock rate. New physical pro- 
posals for overcoming speed and scalability obstacles continue to be developed, leaving the 
ultimate hardware limitations on clock speed and its relation to algorithm execution time 
uncertain. 



4. Conclusions 

We have shown that the actual execution time of Shor's algorithm is dependent on the 
important features of concurrent gate execution, available number of qubits, interconnect 
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topology, and clock speed, as well as the critical choice of an architecture-appropriate 
arithmetic algorithm. Our algorithms have shown a speed-up factor ranging from nearly 
13,000 for factoring a 576-bit number to one million for a 6,000-bit number 
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